Citation Matching in Sanskrit Corpora Using Local Alignment

نویسندگان

  • Abhinandan S. Prasad
  • Shrisha Rao
چکیده

Citation matching is the problem of finding which citation occurs in a given textual corpus. Most existing citation matching work is done on scientific literature. The goal of this paper is to present methods for performing citation matching on Sanskrit texts. Exact matching and approximate matching are the two methods for performing citation matching. The exact matching method checks for exact occurrence of the citation with respect to the textual corpus. Approximate matching is a fuzzy string-matching method which computes a similarity score between an individual line of the textual corpus and the citation. The SmithWaterman-Gotoh algorithm for local alignment, which is generally used in bioinformatics, is used here for calculating the similarity score. This similarity score is a measure of the closeness between the text and the citation. The exactand approximate-matching methods are evaluated and compared. The methods presented can be easily applied to corpora in other Indic languages like Kannada, Tamil, etc. The approximatematching method can in particular be used in the compilation of critical editions and plagiarism detection in a literary work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sentence Alignment for Monolingual Comparable Corpora

We address the problem of sentence alignment for monolingual corpora, a phenomenon distinct from alignment in parallel corpora. Aligning large comparable corpora automatically would provide a valuable resource for learning of text-totext rewriting rules. We incorporate context into the search for an optimal alignment in two complementary ways: learning rules for matching paragraphs using topic ...

متن کامل

An Approach for Grammatical Constructs of Sanskrit Language using Morpheme and Parts- of-Speech Tagging by Sanskrit Corpus

Sanskrit since many thousands of years has been the oriental language of India. It is the base for most of the Indian Languages. Statistical processing of Natural Language is based on corpora (singular corpus). Collection of texts of the written and spoken words is known as Language corpus, which is collected in an organized way, in electronic media for the purpose of linguistic research. It pr...

متن کامل

Comparing Sanskrit Texts for Critical Editions

Traditionally Sanskrit is written without blank, sentences can make thousands of characters without any separation. A critical edition takes into account all the different known versions of the same text in order to show the differences between any two distinct versions, in term of words missing, changed or omitted. This paper describes the Sanskrit characteristics that make text comparisons di...

متن کامل

Extracting Citation Metadata from Online Publication Lists Using BLAST

Scientific research reports require a great deal of citation, therefore an automatic citation tool would be of great use. Due to numerous models, it is difficult to automatically transform semi-structured citation data into structured citations. Some digital library institutes, like ResearchIndex (CiteSeer) or OpCit, have attempted automatic citation parsing. In order to recognize citation meta...

متن کامل

Social Network Relational Vectors for Anonymous Identity Matching

Anonymous fraudulent behavior can generate substantial financial burden and inconvenience. Moreover, the recent threat of terrorist infiltration to both business and government has yielded heightened interest in anonymous identity matching (AIM). Most applications of AIM require sophisticated methods robust to issues such as deliberate variation in identity attributes, missing data, and multi-s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010